Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Implementation of decision tree algorithm dealing with massive noisy data based on Hadoop
LIU Yaqiu, LI Haitao, JING Weipeng
Journal of Computer Applications    2015, 35 (4): 1143-1147.   DOI: 10.11772/j.issn.1001-9081.2015.04.1143
Abstract586)      PDF (750KB)(588)       Save

Concerning that current decision tree algorithms seldom consider the influence of the level of noise in the training set on the model, and traditional algorithms of resident memory have difficulty in processing massive data, an Imprecise Probability C4.5 algorithm named IP-C4.5 was proposed based on Hadoop. When training model, IP-C4.5 algorithm considered that the training set used to design decision trees is not reliable, and used imprecise probability information gain rate as selecting split criterion to reduce the influence of the noisy data on the model. To enhance the ability of dealing with massive data, IP-C4.5 was implemented on Hadoop by MapReduce programming based on file split. The experimental results show that when the training set is noisy, the accuracy of IP-C4.5 algorithm is higher than that of C4.5 and Complete CDT (CCDT), especially when the data noise degree is more than 10%, it has outstanding performance; and IP-C4.5 algorithm with parallelization based on Hadoop has the ability of dealing with massive data.

Reference | Related Articles | Metrics